The objective of this project is to build a recommender system, which is based on the knowledge graph. The database contains fields from candidates' resumes and having information about their skills, education institute, level of education etc. and from this information we will create four fields for the knowledge graph which are 'University','Degree Type', 'Degree Level' and 'Skills'. Once these fields are processed, to feed into the graph, a similarity matrix is going to be prepared. The recommender system will be able to suggest the 'top n' matches for the given candidate ID. The pre-processing of the provided dataset will be done using pandas and spacy library, then the graph will be constructed with 'networkx' and finally similarity between the candidates will be calculated with 'SimiRank algorithm'. At the end we can also manually inspect the level of similarity of the recommended candidates with the given candidate.
import timeit
start_time = timeit.default_timer()
import sys
import os
import numpy as np
import pandas as pd
import json
import spacy
import networkx as nx
from spacy.lang.en import English
import matplotlib.pyplot as plt
%matplotlib inline
os.getcwd()
sys.version
os.chdir('C:/Users/saurabh/Desktop/Knowledge Graph')
dataframe= pd.read_json("Filtered01.json")
df = pd.read_json(dataframe['structuredLayout'].to_json(), orient="index")
df['Details']=df.Details.astype(str)
df.drop(['Extracurricular','Interests', 'Profile', 'Reference', 'Skills','Experience'], inplace=True, axis = 1)
df['University'] = dataframe['universties']
df['Skills'] = dataframe['skillsCluster']
df['Degree'] = dataframe['degrees']
Copy the university column along with the index into a new dataframe
uni_list = []
for index, uni in df['University'].iteritems():
for u in uni.keys():
temp = [index, u]
uni_list.append(temp)
df_uni = pd.DataFrame(uni_list, columns = ['index', 'University'])
df.drop(['University'], inplace=True, axis = 1)
In the original dataframe the universities are already arranged in the order of universities ranking and we are going to take just the first value (highest ranking uni), so creating a new column 'Match', which has the value False if the value in the index column repeats otherwise True.
df_uni["Match"]= df_uni["index"] == df_uni.shift()["index"]
Drop all the rows where value of the 'match' is true
df_uni.drop(df_uni[df_uni.Match == True].index, inplace=True)
Reset the index of original dataframe, to create a new column called 'index', so that the above dataframe created with universities can be merged with it, then joined these two dataframes on the common column 'index' and finally set the datframe's index as 'index'(column) as we do not require this column anymore
df.reset_index()
df_2 = df.join(df_uni.set_index('index'))
Drop the original university column as we are done with extracting the universities from the candidate's profile, along with the column 'match'
df_2.drop(['Match'], inplace= True, axis = 1)
The following code is to extract the degree level and the degree type of the candidate. Worth mentioning here that the first two words of the string have already been extracted, so the following 'for' loop stores that extracted part along with index. We will combine the dataframe with the main dataframe at the end, before passing it to graph
degrees = []
for i,j in df_2['Degree'].iteritems():
for k,l in j:
temp = [i,l, k]
degrees.append(temp)
df_deg = pd.DataFrame(degrees, columns = ['index','type', 'level'])
# Clean the text
import re
def clean_text(text):
text = text.replace('\n', ' ') # remove newline
text = text.replace('/', ' ') # remove forward slashes
text = re.sub(r'[^a-zA-Z ^0-9]', '', str(text)) # letters and numbers only
text = text.lower() # lower case
text = re.sub(r'(x.[0-9])', '', text) # remove special characters
return text
Apply the function clean_text() to the column 'type'
df_deg['type'] = df_deg.apply(lambda x: clean_text(x['type']), axis=1)
In the following part,each sentence gets seperated into individual token, so that the type of the degree can be extracted
# Initialize the tokenizer
from spacy.tokenizer import Tokenizer
nlp = spacy.load("en_core_web_sm")
tokenizer = Tokenizer(nlp.vocab)
the for loop reads the datapoints row-wise and checking if the token is not empty and then storing the extected token in a list called 'tokens'
tokens = []
for doc in tokenizer.pipe(df_deg['type'], batch_size=500):
doc_tokens = []
for token in doc:
if (token.text != ' '):
doc_tokens.append(token.text)
tokens.append(doc_tokens)
Create a new column with the stored tokens
df_deg['token_type'] = tokens
df_deg['token_type'].head(20)
In order to extract tokens efficiently, its important to select the correct index in the most meaningful way and it can be observed from the above cell output that its hard to find a rule in terms of index but, position 2 and 3 gives the type of the degree in most of the cases, however, its important to keep in mind, not to extract too many tokens as later on, it will be passed to the graph and each degree type will act as individual node
df_deg['token_type_selected'] = [i[2:4] for i in df_deg['token_type']]
After extracting the tokens at specific indexes, there are still few tokens, which are not meaningful and it is required to standardize the degree type; the following list contains all the potential degree types, which later will be compared with each extracted token followed by creating a new column if there is a match
# Degree type list
degree_type = ['computer','data','science', 'information', 'technology', 'architecture',
'management','electrical','business', 'administration', 'engineering'
'analytics', 'application','computing', 'digital','marketing',
'food','beverage', 'chemistry', 'health','statistics',
'analysis', 'mechanical', 'accounting','mathematics','electronics',
'telecommunication','property', 'marine','electronics', 'chemical', 'business',
'construction', 'arts', 'law', 'legal', 'network','digital','media','security',
'education', 'project','system', 'anthropology', 'sociology', 'design',
'aviation', 'state','economics', 'physics', 'design','industrial', 'human',
'network','archinformation','commerce','psychology','software', 'translation']
df_deg['type'] = df_deg.apply(lambda x: list(set(x['token_type_selected']) & set(degree_type)), axis=1)
The new created column has lists in each row and the following code convert each row items to a string, which is required to treat them as an individual node
df_deg['type'] = [' '.join(str(x) for x in i) for i in df_deg['type']]
In the degree dataframe, there are few versions of the same degrees like bachelor has been mentioned as bechelors in, bachelors of, bachelors etc, and similarly for masters and diploma, so the following function will match the first three letters and if, it is 'bac' then we convert the entire string to bachelors, if 'mas' then masters and if 'dip' then diploma and None in all other cases. Before applying the function to the dataframe we will convert it to string, so that we can perform string operation(matching string)
def process_degree(s):
if 'bac' in s:
s = s[:s.rindex('bac')] + 'bachelor'
elif 'mas' in s:
s = s[:s.rindex('mas')] + 'master'
elif 'dip' in s:
s = s[:s.rindex('dip')] + 'diploma'
else:
s= None
return s
df_deg['level'] = df_deg['level'].apply(str)
df_deg['level'] = df_deg.apply(lambda x: process_degree(x['level']), axis=1)
Finally drop the unnecessary columns
df_2.drop(['Degree'], inplace= True, axis = 1)
df_2.drop(['Education'], inplace=True, axis = 1)
df_deg.drop(['token_type'], inplace= True, axis = 1)
df_deg.drop(['token_type_selected'], inplace= True, axis = 1)
The following list contains all the potential technical skills, which later will be compared with each extracted token followed by creating a new column if there is a match
# Tech terms list
tech_terms = ['python', 'r', 'sql', 'hadoop', 'spark', 'java', 'sas', 'tableau','mysql',
'hive', 'scala', 'aws', 'c', 'c++', 'matlab', 'tensorflow', 'excel','angular',
'nosql', 'linux', 'azure', 'scikit', 'machine learning', 'statistic',
'analysis', 'computer science', 'visual', 'ai','artificial intelligence', 'deep learning','mongodb',
'nlp', 'natural language processing', 'neural network', 'mathematic',
'database', 'oop', 'blockchain','cloud', 'bootstrap', 'unix','agile',
'html', 'css', 'javascript', 'jquery', 'git', 'photoshop', 'illustrator',
'word press', 'seo', 'responsive design', 'php', 'mobile', 'design', 'react',
'security', 'ruby', 'fireworks', 'json', 'node', 'express', 'redux', 'ajax',
'java', 'api','ios','big data','php','adobe','assembly','wireframe','couchdb',
'ui prototype', 'ux writing', 'interactive design','iot','ruby on rails',
'metric', 'analytic', 'ux research', 'mockup', 'c#','web development',
'prototype', 'test', 'ideate', 'usability', 'high-fidelity design', 'karma',
'framework','testing', 'xml','oracle','node.js','scrum','uml','database management',
'autocad','swift', 'xcode', 'spatial reasoning', 'human interface', 'core data',
'grand central', 'network', 'objective-c', 'foundation', 'uikit', 'asp.net',
'cocoatouch', 'spritekit', 'scenekit', 'opengl', 'metal','data engineering',
'dreamweaver','statistical analysis','coding','basic','logic','docker',
'ms access','computer vision','html5','sed','abap']
df_2['Skills'] = df_2.apply(lambda x: list(set(x['Skills']) & set(tech_terms)), axis=1)
The explode function take each item of the list and separate them in individual rows, keeping the same index. We require to do this operation as when we pass the dataframe to graph, we want separate node for each skill.
df_final = df_2.explode('Skills')
df_final['Edge'] = ['edge'] * len(df_final)
Create an index column, which will be used to join the degree dataframe with the final dataframe
df_final.reset_index(inplace=True)
The degree dataframe with final dataframe will be merged and will set the left join as we need to use the index column from final dataframe otherwise we get rows only from the degree dataframe and lose rest of the rows
df_final = df_final.merge(df_deg, on='index', how='left')
In order to enhance the readability of the nodes, few of the strings can be replaced to make them more meaningful
df_final['type'] = df_final['type'].replace ({'science computer':'computer science',
'technology information':'information technology', 'system computer':'computer system',
'application computer':'computer application', 'management project':'project management',
'technology electronics':'electronics technology', 'science electrical':'electrical science',
'science data':'data science', 'technology computer':'computer technology',
'science information':'information science','science computing':'computing science'})
while processing degree type, if there is no match empty string has been introduced; which will be lead to nan values, so these values need to be dropped.
df_final['type'] = df_final['type'].replace('', np.nan, regex=True)
List of all the degree types
df_final.drop(['index'], inplace=True, axis = 1)
df_final.dropna(inplace=True)
To build the knowledge graph with columns; Skills, University and Degree, we will first build individual graphs for Skills and University then combine them,then again combine the built graph with the graph for degree and the final graph will have all the fields in it.
kg_df = pd.DataFrame({'source' : df_final['Skills'], 'target': df_final['Details'], 'edge':df_final['Edge']})
G_skills = nx.from_pandas_edgelist(kg_df, "source", "target", edge_attr = True, create_using = nx.DiGraph())
fig, ax = plt.subplots(figsize=(30, 40), dpi=80)
pos = nx.spring_layout(G_skills)
nx.draw(G_skills, with_labels=True,node_size= 4500, node_color= 'skyblue', edge_cmap=plt.cm.Blues, pos = pos)
plt.show()
fig.savefig('skills_KG.png')
kg_df = pd.DataFrame({'target' : df_final['Details'], 'source': df_final['University'], 'edge':df_final['Edge']})
G_uni = nx.from_pandas_edgelist(kg_df, "source", "target", edge_attr = True, create_using = nx.DiGraph())
fig, ax = plt.subplots(figsize=(30, 40), dpi=80)
pos = nx.spring_layout(G_uni)
nx.draw(G_uni, with_labels=True,node_size= 4500, node_color= 'skyblue', edge_cmap=plt.cm.Blues, pos = pos)
plt.show()
fig.savefig('uni_KG.png')
kg_df = pd.DataFrame({'source' : df_final['level'], 'target': df_final['Details'], 'edge':df_final['Edge']})
G_degree_level = nx.from_pandas_edgelist(kg_df, "source", "target", edge_attr = True, create_using = nx.DiGraph())
fig, ax = plt.subplots(figsize=(30, 40), dpi=80)
pos = nx.spring_layout(G_degree_level)
nx.draw(G_degree_level, with_labels=True,node_size= 4500, node_color= 'skyblue', edge_cmap=plt.cm.Blues, pos = pos)
plt.show()
fig.savefig('degree_KG.png')
kg_df = pd.DataFrame({'source' : df_final['type'], 'target': df_final['Details'], 'edge':df_final['Edge']})
G_degree_type = nx.from_pandas_edgelist(kg_df, "source", "target", edge_attr = True, create_using = nx.DiGraph())
fig, ax = plt.subplots(figsize=(30, 40), dpi=80)
pos = nx.spring_layout(G_degree_type)
nx.draw(G_degree_type, with_labels=True,node_size= 4500, node_color= 'skyblue', edge_cmap=plt.cm.Blues, pos = pos)
plt.show()
fig.savefig('degree_field KG.png')
In order to construct the final graph, first we combine graphs of skills and uni and then combine degree and degree_field and finally combine the two combined graph
G_combined = nx.compose(G_skills,G_uni)
G_combined_2 = nx.compose(G_degree_level,G_degree_type)
G_final = nx.compose(G_combined,G_combined_2)
fig, ax = plt.subplots(figsize=(30, 40), dpi=80)
pos = nx.spring_layout(G_final)
nx.draw(G_final, with_labels=True,node_size= 4500, node_color= 'skyblue', edge_cmap=plt.cm.Blues, pos = pos)
plt.show()
fig.savefig('Final_KG.png')
To calculate the similarity, we will implement SimiRank algorithm, which basically says that "two objects are considered to be similar if they are referenced by similar objects."
sim_final = nx.similarity.simrank_similarity(G_final)
from heapq import nlargest
def find_similarity_final(key, graph):
if key in sim_final:
top3 = nlargest(4,sim_final.get(key), key=sim_final.get(key).__getitem__)
return top3[1:]
else:
return 'key not exist'
find_similarity_final('1315', G_final)
similarity score of all the candidates with given candidate
print(sim_final.get("1315").__getitem__('751'))
print(sim_final.get("1315").__getitem__('1884'))
print(sim_final.get("1315").__getitem__('24'))
pd.options.display.max_colwidth = 100
df_final['Degree_level']=df_final['level']
df_final['Degree_type']=df_final['type']
df_final.drop(['level','type'], inplace=True, axis = 1)
df_final[df_final['Details'] == '1315']
df_final[df_final['Details'] == '751']
df_final[df_final['Details'] == '1884']
df_final[df_final['Details'] == '24']
It can obsevered that the recommended candidates 751, 1884 and 24 have many commonalities with the given candidate 1315. Thus, it can be concluded that the model is performing satisfactorily. However, there are certainly various limitations as well of the project. First of all, we can still explore other attributes as well like interest, certifications etc, which surely will improve the accuracy. Using knowledge graph for recommendation has potential to use datapoints which may remain unused or unseen otherwise, thus exploring these attributes in future can be beneficial.
Time taken for the complete model to run
elapsed = timeit.default_timer() - start_time
elapsed = "{:.2f}".format(elapsed)
print(str(elapsed + ' seconds'))